Geostatistics

6. Variograms

This lecture focuses on theoretical variogram functions and the fitting to experimental data.

  • empirical variograms
  • theoretical functions
  • variogram parameters
  • fitting

6.1 From Covariance to Semi-variance

It is possible to work with spatial covariance functions in geostatistics, like we used one in the last lecture. In this lecture, we want to focus on the most common tool of a variogram, which is very closely related to the covariance function we built in the last lecture.

The variogram comes with a few advantages

In the last lecture, we implemented each and every calculation ourselves. This was important to understand, what's going on. In this lecture, we will use thrid-party Python modules to save some time. New content will again be introduced with examples how to algorithm the shown problems. This can then be directly implemented in other languages than Python

First, we need to go back to the Covariance function we used before:

For a set of observations $x$ and another set of observations $x_h$ at a spatial lag $h$, their covariance was defined as:

$$ Cov(x,x_h) = \frac{\sum_{i=1}^N [(x_i - \mu_x)*(x_{h, i} - \mu_{x_h})]}{N} $$

Where $N$ is the length of $u$ and $\mu$ is their respective expected value. That means, calculating the covariance at a specific lag, the two $\mu$ are constant for all elements in the sum. We can solve the parenthesis in the equation above to:

$$ Cov(x,x_h) = \frac{\sum_{i=1}^N (x_i*x_{h,i})}{N} - \mu_x * \mu_{x_h} = Cov(h)$$

If we want to have a good estimation for the population from the covariance as defined above, we have to make sure, that the three expected values are actually a good measure for the samples distribution:

That's the case for a normal distribution, as all expected values are just the arithmetic mean values. Thus, we require that:

  • $x$ and $x_h$ are normal distributed
  • Remind that $h$ is a spatial lag, not a location. Thus $x_h$ should be normal distributed at any given lag.

This is called second order stationarity, as the the mean and the variance of the Variable should be stationary

But we run into more assumptions/limitations here. Think of the Covariance function at the lag 0

$$ Cov(0) = Cov(x,x) := Var(x,x) $$

Thus, we require the empirical covariance function to fit to the Variance at lag 0, which is often difficult to reach due to the limited amount of point pairs at very close distance.

Further, we are seeking for the a special lag $r$, at which samples become statistically independed. Theoretical Convariance model function will not map values $Cov < 0$, but in turn empirical values might be negative.

Thus we need means that can in principle model unbounded increasing dissimilarities with distance.

That function is called a semi-variogram. It does not use the covariance, but the semi-variance and in an case of stationarity it's defined like:

$$ \gamma(h) = \sigma^2 - C(h) $$

The semi-variance can be calculated like:

$$ \gamma(h) = \frac{1}{2N} \sum_{i=1}^N (x_i - x_{h,i})^2 $$
  • $\gamma$ does not involve the $\mu$ anymore
  • We calculate a measure for the pair-wise deviations that becomes larger for bigger differences

The semi-variance has a few advantages over using the covariance function:

  • We can relax the assumptions of stationarity to requireing only the deviations to be only dependent of the lag $h$
  • The $\gamma(0)$ can be 0, or larger. Which fits better to real world data that often is prone to measurement uncertainty or small scale variations
  • $\gamma$ may or may not increase unbounded with $h$. From this we can define unlimited functions in cases where the covariance function is not defined.

6.2 Empirical variograms

One of the (Python) modules for geostatistics is scikit-gstat. To actually calculate a variogram, we will use this module to obtain all the data we used in the last lecture with less effort.

For exercises, you can of course also reuse the code from the last lecture.

The main advantage of using the Variogram class in scikit-gstat is that it will make all the intermediate calculations steps available to us.

In [5]:
# using the same settings as in the last lecture
V = skg.Variogram(coords, sample.z, maxlag='median', n_lags=6, normalize=False)
Warning: 'harmonize' is deprecated and will be removedwith the next release. You can add a 'SKG_SUPPRESS' environment variable to suppress this warning.

The 1D distance array is available as V.distance, the 2D distance matrix as V.distance_matrix. The bin edges are available as V.bins and the grouping array (1D version) as V.lag_groups.

In [6]:
print(V.bins.round(2))
print(V.distance[:10].round(2))
print(V.lag_groups()[:10])
[ 8.64 17.29 25.93 34.57 43.21 51.86]
[19.65 12.21 15.81 66.04 74.46 22.2  62.1  77.52 35.   30.41]
[ 2  1  1 -1 -1  2 -1 -1  4  3]

The values are also available, as well as the pairwise differences. The class again makes the 1D and 2D version available. But there's also a small inconsistency in the class. The actual observation are available as V.values. The pairwise differences are at V._diff as 1D and V.distance_matrix as 2D.

In [7]:
print(V.value_matrix[:10, :10].round(2))
print('----------------------------')
print(V._diff[:10].round(2))
[[0.   0.04 0.49 0.35 1.49 0.71 0.78 0.71 2.37 0.97]
 [0.04 0.   0.53 0.39 1.45 0.75 0.74 0.75 2.41 0.93]
 [0.49 0.53 0.   0.14 1.98 0.22 1.27 0.22 1.87 1.46]
 [0.35 0.39 0.14 0.   1.84 0.36 1.13 0.36 2.02 1.32]
 [1.49 1.45 1.98 1.84 0.   2.2  0.71 2.2  3.86 0.52]
 [0.71 0.75 0.22 0.36 2.2  0.   1.49 0.   1.66 1.68]
 [0.78 0.74 1.27 1.13 0.71 1.49 0.   1.49 3.14 0.19]
 [0.71 0.75 0.22 0.36 2.2  0.   1.49 0.   1.66 1.68]
 [2.37 2.41 1.87 2.02 3.86 1.66 3.14 1.66 0.   3.34]
 [0.97 0.93 1.46 1.32 0.52 1.68 0.19 1.68 3.34 0.  ]]
----------------------------
[0.04 0.49 0.35 1.49 0.71 0.78 0.71 2.37 0.97 0.05]

Thus, now we are settled to calculate the semi-variances:

In [8]:
k = 6
vario = []

for _k in range(k):
    squared_differences = []
    
    for idx, group in enumerate(V.lag_groups()):
        if group == _k:  # right bin
            squared_differences.append(V._diff[idx]**2)
    
    # gamma is sum divided by 2 * length
    vario.append(sum(squared_differences) / (2*len(squared_differences)))
In [10]:
show(variogram)
  • The semi-variance is increasing linearly.
  • While this is a valid empirical variogram, it is absolutely possible that we just cut off the point pair formation too early when setting the maximum lag to median, which is roughly at 50 meters
  • The samples were taken from a 100x100 area, so we can go to 100 as a meaningful maximum lag distance setting.

  • To resolve the larger distance, we should also increase the number of bins

For this to happen, we need to go back and re-define the bins, then re-index the lag class grouping and iterate over all groups to collect all pair-wise differences in the new lags bins. Then, vario can be re-calculated.

Or, we let the scikit-gstat Variogram class demonstrate, when object-orientated programming is a good idea:

In [11]:
V.maxlag = 100
V.n_lags = 8

That's it.

The resulting empirical, or experimental, variogram can be accessed from a property of same name:

In [12]:
print(V.experimental.round(2))
[0.07 0.26 0.53 0.92 0.95 0.79 0.71 0.76]
In [14]:
show(variogram)

This way, you can check hundreds of fine-tuned values with ease. But it is neccessary to understand what is happening under the hood.

6.3 Theoretical variogram models

Empirical variograms already tell us a lot about spatial structure and an apparent spatial dependency in the data set. Now, we want to describe it more systematically through a mathematical function.

That has a lot of advantages including:

  • we can define models of known mathematical properties
  • one property is that the model is monotonically increasing, which is a prerequisite to use it for interpolation
  • we can compare models in a systematic manner
  • Most theoretical variogram models can be described by only three parameters:

© gisgeography.com - source: https://gisgeography.com/wp-content/uploads/2016/10/Variogram-Nugget-Range-Sill-425x279.png
  • Sill is the total variance in the dataset that the samples are approaching with distance.
  • Range, which I like to call effective range, is the distance at which the sill is reached. For some models 95% of the sill is used, because the model is approaching the sill asymtotically, but never reaches it.
  • Nugget is the y-axis intersect. That's the (semi)variance at a lag distance of 0

A very helpful diagnostic tool, when the parameters above have been calculated is to look at the nugget/sill ratio.

The nugget is the share of the total variance that you will not be able to model spatially, no matter what. Therefore, using this ratio can give you an idea of how much variance in the sample the model can explain. Keep in mind that this is a statistical model of the sample, not a feature of the population.

6.3.1 spherical model

The most used geostatistical model is the so called spherical model. It is defined like:

$$ \gamma(h) = b+ C_0 * \left(\frac{3}{2}*\frac{h}{a} - \frac{1}{2}*\left(\frac{h}{a}\right)^3 \right) $$

if $h \leq a $ and

$$ \gamma(h) = b + C_0$$

otherwise. here: $C_0$ is the sill, $b$ is the nugget and $a$ is the variogram parameter range, which is not the effective range $r$. For the case of a spherical model $a := r$

In [15]:
def spherical(h, r, C0, b):
    a = 1 * r
    if h <= a:
        return b + C0 * (1.5*(h / a) - 0.5 * (h / a)**3)
    else:
        return b + C0
In [16]:
x = list(range(100))
y = [spherical(_, 45, 13, 2) for _ in x]
In [18]:
show(models)

6.3.2 exponential model

The exponential model is also quite common in geoscience. It's very similar to the spherical model, but is way sharper increasing at short lags and smooths out at larger lags.

$$ \gamma(h) = b + C_0 * \left(1 - exp\left(-\frac{h}{a} \right) \right) $$

with the relation between effective range $r$ and range paramter $a$

$$ a = \frac{r}{3}$$
In [19]:
def exponential(h, r, C0, b):
    a = r / 3.
    return b + C0 * (1 - np.exp(- h / a))
In [20]:
y2 = [exponential(_, 45, 13, 2) for _ in x]
In [22]:
show(models)

6.3.3 Gaussian model

The last of the three most important theoretical model is the Gaussian model. It can model variables, that are quite similar up to a specific lag and then drastically increase in dissimilarity.

$$ \gamma(h) = b + C_0 * \left(1 - exp\left(-\frac{h^2}{a^2} \right) \right)$$

and $$ a = \frac{r}{2} $$

In literature, you will also find $a = \frac{r}{\pi}$ and $a = \frac{r}{3}$.

In [23]:
def gaussian(h, r, C0, b):
    a = r / 2.
    
    return b + C0 * (1 - np.exp(- h**2 / a**2))
In [24]:
y3 = [gaussian(_, 45, 13, 2) for _ in x]
In [26]:
show(models)

6.4 Fitting

One of the most important steps in geostatistics, is fitting the theoretical model to empirical data from the data sample. If your model does not represent your data well, you will run into issues during interpolation. No matter how sophisticated or performant the algorithm is, it will fail.

6.4.1 Manual fitting

The easiest, but in my opinion also most effective way to fit a model to data is to choose the parameters manually. Let's recall the experimental variogram:

In [27]:
show(variogram)

I would say the nugget=0. I would select the sill around sill=0.76 in order to choose a bit longer effective range at range=70. That will fit the first few bins better at the cost of the more distant lags.

In [28]:
r = 70
C0 = 0.76
b = 0

y1 = [spherical(_, r, C0, b) for _ in x]
y2 = [exponential(_, r, C0, b) for _ in x]
y3 = [gaussian(_, r, C0, b) for _ in x]
In [30]:
show(variogram)

Obviously, the Gaussian model is describing the first few bins very well, if we accept he remaining bins to be fitted less precisely. The spherical model is more or less ok, the exponential is quite off and not describing the data well.

6.4.2 Automatic fitting

The field of automatic fitting is imense. And it is by no means bound to the field of geostatistics.

What we are actually trying to achieve is a regression, we want to fit a function to an independent variable to predict the dependent variable as good as possible. There are hundreds of methods out there, very specialized and very generalized ones.

We will implement a least squares algorithm, because it is easy to implement and easy to understand, while still being fairly performant for our usecase.

We will implement a brute force version thereof.

The basic idea is to find the optimal parameter set from a given parameter space, to predict the dependent variable with as little error as possible.

Or in other words: we need an objective function that will give us the error of the function, when the actual value is known:

I like the term loss function as well, which is more common in the machine learning world.

Now, with that function the optimization problem can be solved by minimizing the loss function. We use the one combination of range and sill, that yields the smallest root mean squared error, or RMSE.

$$ RMSE = \sqrt{\frac{\sum_{i=1}^N (y^* - y)^2}{N}} $$

where $y$ would be the observation and $y^*$ the prediction for the same value of $x$.

In [31]:
def loss(func, bins, gamma, _range, _sill):
    mod = [func(h, _range, _sill, 0) for h in bins]
    dev = [(m - o)**2 for m,o in zip(mod, gamma)]
    
    return np.sqrt(sum(dev) / len(dev))

There are a few things to consider:

  • func would be the function we try to fit: spherical, exponential or gaussian in our case
  • bins and gamma would better be called x and y in a general least squares implementation
  • we will not find the optimal nugget, as we already know that's nugget=0.

The least squares will of course with any number of parameters (in theory)

The only missing thing now are the parameters to plug in. Which values should be use?

This is where brute force, sets in. We will try all values.

That's in most cases not really a clever idea. But in the case of variogram fitting, we can dramatically limit the parameter space:

  • range, sill and nugget cannot be < 0 as $\gamma(h) >= 0$ by definition
  • Our observations have a spatial extent. And usually we used a maximum lag (100 in our case). This is the natural upper limit for the range.
  • For the sill, we can use the sample variance as an upper limit. Think about it: As long as we assume that the sample is somehow representative for the population, predicting a semi-variance which is much larger than the total sample variance does not really make sense. We can i.e. use 2 x Var. Much more will just be an effect of small sample sizes in the respective bin

Finally, the step between two parameters can be limited as well. If we bin distances e.g. like:

In [32]:
V.bins.round(1)
Out[32]:
array([ 12.5,  25. ,  37.5,  50. ,  62.5,  75. ,  87.5, 100. ])

Why would be test the range parameter 16.01, 16.02, 16.03 and so on?

We can easily split up both parameter ranges into 100 chunks

In [33]:
# 0 <= range <= 100
r_test = list(range(100))    

# 0 <= sill <= sample variance
s_test = [(_ / 100) * 2*sample.z.var() for _ in range(100)]

# parameter 'mesh'-grid
params = []
for _r in r_test:
    for _s in s_test:
        params.append((_r, _s, ))
print('Testing %d parameter combinations' % len(params))
Testing 10000 parameter combinations

10000 combinations is something that our computer can easily handle. Time to brute force!

In [34]:
result = []

t1 = time()
for _range in r_test:
    r_result = []
    for _sill in s_test:
        r_result.append(loss(gaussian, V.bins, V.experimental, _range, _sill))
    result.append(r_result)
t2 = time()
print('Fitting took %.3f seconds' % (t2 - t1))
/usr/share/miniconda/lib/python3.7/site-packages/ipykernel_launcher.py:4: RuntimeWarning: divide by zero encountered in double_scalars
  after removing the cwd from sys.path.
Fitting took 1.576 seconds
In [36]:
show(optim)

We can see a few things here:

  • To minimize, we need the smallest RMSE, represented by the green colors

  • The small sill values yield so bad results, that the range didn't matter. That's expected.

  • The effect of the range is only visible for specific sill value range

  • Our manual fitting with range=70 and sill=0.76 was already close to optimality.

Next, we can extract the parameter set of smallest RMSE

In [37]:
# we store the smallest rmse and its index
low_index = (0,0)
low_rmse = result[0][0]

for i in range(len(r_test)):
    for j in range(len(s_test)):
        if result[i][j] < low_rmse:
            low_index = (i,j)
            low_rmse = result[i][j]

# extract parameters
opt_range = r_test[low_index[0]]
opt_sill = s_test[low_index[1]]
print('Best Parameter set: range=%.0f  sill=%.2f  [RMSE: %.3f]' % (opt_range, opt_sill, low_rmse))
Best Parameter set: range=66  sill=0.82  [RMSE: 0.106]

Apply the Gaussian model, like we did before:

In [38]:
y_opt = [gaussian(_, opt_range, opt_sill, 0) for _ in x]
In [40]:
show(fit_compare)
  • Bins 4 & 5 are giving the manual fit a slightly worse RMSE than the least squares fit
  • The manual fit is better hitting the first three bins (as we fitted it that way)

If we consider the best 5% of all tested parameter sets to be valid solutions for the minimizing problem, we can plot all their variograms into the figure as well, to get a better idea of the spread.

In [42]:
show(fit_compare)

6.4.3 Adjusting the fit

  • The comparison of our manual fit to the least squares was not really fair
  • We accepted the 4th and 5th bin to be a bit off
  • therefore, we put focus on only the first three bins

We can implement this into the least squares as well. We will, for illustrating the procedure, just give the three first bins tripple weight.

In [43]:
def loss_adjust(func, bins, gamma, _range, _sill):
    mod = [func(h, _range, _sill, 0) for h in bins]
    dev = [(m - o)**2 for m,o in zip(mod, gamma)]
    # give tripple weight.
    dev[:3] = [_ * 3. for _ in dev[:3]]
    
    return np.sqrt(sum(dev) / len(dev))
In [44]:
result_adj = []

t1 = time()
for _range in r_test:
    r_result = []
    for _sill in s_test:
        r_result.append(loss_adjust(gaussian, V.bins, V.experimental, _range, _sill))
    result_adj.append(r_result)
t2 = time()
print('Fitting took %.3f seconds' % (t2 - t1))
/usr/share/miniconda/lib/python3.7/site-packages/ipykernel_launcher.py:4: RuntimeWarning: divide by zero encountered in double_scalars
  after removing the cwd from sys.path.
Fitting took 1.529 seconds
In [46]:
show(row(optim, optim_adj))

While the RMSE slightly drop, which is expected, the green area narrows down. This makes it easier to select parameter sets from the parameter space

In [48]:
show(column(fit_adj, fit_compare))

We can see a few things here:

  • The adjusted red model in the upper plot can predict the first three observations way better.
  • and it only differs in sill from our manual fit
  • The grey envelope (5% of best parameters) is much narrower on the first bins and a bit wider on the last
  • These models will introduct less error on small lags at the cose of higher error at larger lags